La finalidad de esta práctica es desarrollar el proceso completo de un modelo de Machine Learning aplicando los conocimientos adquiridos en la asignatura. Para ello empleo un dataset que contiene información sobre diversas transacciones por parte de varios clientes. En primer lugar, analizaremos los datos, luego haremos las transformaciones necesarias y evaluaremos distintos modelos para ver el que mejor se ajusta a nuestro problema. El objetivo principal es generar un modelo que dada una transacción, emita la probabilidad de que exista fraude.
# Import the Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import scipy.stats as ss
import warnings
import pickle
from category_encoders import TargetEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import category_encoders as ce
# Functions
def get_corr_matrix(dataset = None, metodo='pearson', size_figure=[10,8]):
# To obtain the Spearman correlation, just change the method to 'spearman'
if dataset is None:
print(u'\nHace falta pasar argumentos a la función')
return 1
sns.set(style="white")
# Compute the correlation matrix
corr = dataset.corr(method=metodo)
# Set self-correlation to zero to avoid distraction
for i in range(corr.shape[0]):
corr.iloc[i, i] = 0
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=size_figure)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, center=0,
square=True, linewidths=.5, cmap ='viridis' ) #cbar_kws={"shrink": .5}
plt.show()
return 0
def cramers_v(confusion_matrix):
"""
calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
confusion_matrix: tabla creada con pd.crosstab()
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
# Read data
datos = pd.read_csv("./datos/Copia de Original_dataset_payments_fraud.csv", sep=';')
print("El dataset está compuesto por", len(datos.index), "filas y", len(datos.columns), "columnas")
datos.head()
El dataset está compuesto por 1048575 filas y 19 columnas
| step | type | amount | gender | device | connection_time | nameOrig | race | oldbalanceOrg | age | newbalanceOrig | zone | user_number | nameDest | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | PAYMENT | 9839.64 | man | mac | 0,140039412 | C1231006815 | black | 170136.0 | 85 | 160296.36 | capital | 138 | M1979787155 | 5 | 1 | 0.0 | 0.0 | 0 |
| 1 | 1 | PAYMENT | 1864.28 | woman | mac | 0,496889534 | C1666544295 | asian | 21249.0 | 57 | 19384.72 | country | 909 | M2044282225 | 1 | 0 | 0.0 | 0.0 | 0 |
| 2 | 1 | TRANSFER | 181.00 | man | pc | 0,781150327 | C1305486145 | asian | 181.0 | 66 | 0.00 | capital | 2569 | C553264065 | 10 | 0 | 0.0 | 0.0 | 1 |
| 3 | 1 | CASH_OUT | 181.00 | man | mac | 0,565068378 | C840083671 | black | 181.0 | 31 | 0.00 | country | 1787 | C38997010 | 3 | 0 | 21182.0 | 0.0 | 1 |
| 4 | 1 | PAYMENT | 11668.14 | unknow | mac | 0,517114493 | C2048537720 | black | 41554.0 | 90 | 29885.86 | country | 3997 | M1230701703 | 8 | 0 | 0.0 | 0.0 | 0 |
# convert connection time variable to float and change commas to dots
datos['connection_time'] = datos['connection_time'].str.replace(',','.').astype(np.float64)
# drop several variables: the race and gender variables because should not be used to determine whether a transaction
# is fraudulent or not. The nameOrig and nameDest variables are removed because are just identifiers that do not provide
# any information and are not relevant for predicting fraud.
datos = datos.drop(['gender', 'race', 'nameOrig','nameDest'], axis=1)
datos.head()
| step | type | amount | device | connection_time | oldbalanceOrg | age | newbalanceOrig | zone | user_number | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | PAYMENT | 9839.64 | mac | 0.140039 | 170136.0 | 85 | 160296.36 | capital | 138 | 5 | 1 | 0.0 | 0.0 | 0 |
| 1 | 1 | PAYMENT | 1864.28 | mac | 0.496890 | 21249.0 | 57 | 19384.72 | country | 909 | 1 | 0 | 0.0 | 0.0 | 0 |
| 2 | 1 | TRANSFER | 181.00 | pc | 0.781150 | 181.0 | 66 | 0.00 | capital | 2569 | 10 | 0 | 0.0 | 0.0 | 1 |
| 3 | 1 | CASH_OUT | 181.00 | mac | 0.565068 | 181.0 | 31 | 0.00 | country | 1787 | 3 | 0 | 21182.0 | 0.0 | 1 |
| 4 | 1 | PAYMENT | 11668.14 | mac | 0.517114 | 41554.0 | 90 | 29885.86 | country | 3997 | 8 | 0 | 0.0 | 0.0 | 0 |
# Dimension with and without duplicates
print(datos.shape, datos.drop_duplicates().shape)
(1048575, 15) (1048575, 15)
No hay duplicados
# Types of variables
print(datos.dtypes.sort_values().to_frame('feature_type').groupby(by = 'feature_type').size().to_frame('count').reset_index())
datos.dtypes
feature_type count 0 int64 6 1 float64 6 2 object 3
step int64 type object amount float64 device object connection_time float64 oldbalanceOrg float64 age int64 newbalanceOrig float64 zone object user_number int64 user_connections int64 security_alert int64 oldbalanceDest float64 newbalanceDest float64 isFraud int64 dtype: object
Nuestra variable objetivo es 'isFraud'. Toma dos posibles valores 1 si es fraude y 0 si no lo es
# Distribution of the target variable values
pd_plot_isFraud = datos['isFraud']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_isFraud_conteo = datos['isFraud'].value_counts().reset_index()
pd_plot_isFraud_pc = pd.merge(pd_plot_isFraud, pd_plot_isFraud_conteo, on=['index'], how='inner')
print(pd_plot_isFraud_pc)
fig = px.histogram(pd_plot_isFraud_pc, x="index", y=['percent'])
fig.show()
index percent isFraud 0 0 99.89109 1047433 1 1 0.10891 1142
Podemos ver que el dataset está muy desbalanceado en favor de la clase 0. Solo 0.11% de isFraud para el conjunto de datos
pd_series_null_columns = datos.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = datos.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
(15,) (1048575,)
# null check
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = datos['isFraud'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/datos.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/datos.shape[1]
# null by columns
pd_null_columnas
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| device | 104580 | 0.099735 |
| zone | 104414 | 0.099577 |
| step | 0 | 0.000000 |
| type | 0 | 0.000000 |
| amount | 0 | 0.000000 |
| connection_time | 0 | 0.000000 |
| oldbalanceOrg | 0 | 0.000000 |
| age | 0 | 0.000000 |
| newbalanceOrig | 0 | 0.000000 |
| user_number | 0 | 0.000000 |
| user_connections | 0 | 0.000000 |
| security_alert | 0 | 0.000000 |
| oldbalanceDest | 0 | 0.000000 |
| newbalanceDest | 0 | 0.000000 |
| isFraud | 0 | 0.000000 |
Los nulos por columna no llegan al 10% que es una cantidad razonable de nulos. Por lo tanto, las mantengo todas
# row nulls
pd_null_filas
| nulos_filas | target | porcentaje_filas | |
|---|---|---|---|
| 676465 | 2 | 0 | 0.133333 |
| 28596 | 2 | 0 | 0.133333 |
| 672172 | 2 | 0 | 0.133333 |
| 1025371 | 2 | 0 | 0.133333 |
| 965223 | 2 | 0 | 0.133333 |
| ... | ... | ... | ... |
| 386252 | 0 | 0 | 0.000000 |
| 386254 | 0 | 0 | 0.000000 |
| 386255 | 0 | 0 | 0.000000 |
| 386256 | 0 | 0 | 0.000000 |
| 1048574 | 0 | 0 | 0.000000 |
1048575 rows × 3 columns
No hay filas con más de un 20% de nulos. Como todas las filas están por debajo del 20% de nulos las voy a mantener
# Null analysis by columns. Overview
nulos_col = pd.merge(datos.isnull().sum().sort_values().to_frame('missing_value').reset_index(),
datos.dtypes.to_frame('feature_type').reset_index(),
on = 'index',
how = 'inner')
nulos_col['columns_percentage'] = nulos_col['missing_value']/datos.shape[0]
nulos_col.sort_values(['missing_value', 'feature_type', 'columns_percentage'], ascending=False)
| index | missing_value | feature_type | columns_percentage | |
|---|---|---|---|---|
| 14 | device | 104580 | object | 0.099735 |
| 13 | zone | 104414 | object | 0.099577 |
| 1 | type | 0 | object | 0.000000 |
| 2 | amount | 0 | float64 | 0.000000 |
| 3 | connection_time | 0 | float64 | 0.000000 |
| 4 | oldbalanceOrg | 0 | float64 | 0.000000 |
| 6 | newbalanceOrig | 0 | float64 | 0.000000 |
| 10 | oldbalanceDest | 0 | float64 | 0.000000 |
| 11 | newbalanceDest | 0 | float64 | 0.000000 |
| 0 | step | 0 | int64 | 0.000000 |
| 5 | age | 0 | int64 | 0.000000 |
| 7 | user_number | 0 | int64 | 0.000000 |
| 8 | user_connections | 0 | int64 | 0.000000 |
| 9 | security_alert | 0 | int64 | 0.000000 |
| 12 | isFraud | 0 | int64 | 0.000000 |
Este resumen me sirve para observar las variables categóricas y numéricas y si tienen valores nulos
Una vez analizadas cada una de la variables de la tabla y comprender su descripción y significado, determinamos las variables categoricas y las variables numéricas.
# create the lists with the categorical and numerical variables
categoricas = list(datos.select_dtypes(include=['object'], exclude=np.number).columns)
numericas = list(datos.select_dtypes(exclude=['object'], include=np.number).columns)
print(categoricas, numericas)
['type', 'device', 'zone'] ['step', 'amount', 'connection_time', 'oldbalanceOrg', 'age', 'newbalanceOrig', 'user_number', 'user_connections', 'security_alert', 'oldbalanceDest', 'newbalanceDest', 'isFraud']
datos[categoricas].head()
| type | device | zone | |
|---|---|---|---|
| 0 | PAYMENT | mac | capital |
| 1 | PAYMENT | mac | country |
| 2 | TRANSFER | pc | capital |
| 3 | CASH_OUT | mac | country |
| 4 | PAYMENT | mac | country |
datos[numericas].head()
| step | amount | connection_time | oldbalanceOrg | age | newbalanceOrig | user_number | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9839.64 | 0.140039 | 170136.0 | 85 | 160296.36 | 138 | 5 | 1 | 0.0 | 0.0 | 0 |
| 1 | 1 | 1864.28 | 0.496890 | 21249.0 | 57 | 19384.72 | 909 | 1 | 0 | 0.0 | 0.0 | 0 |
| 2 | 1 | 181.00 | 0.781150 | 181.0 | 66 | 0.00 | 2569 | 10 | 0 | 0.0 | 0.0 | 1 |
| 3 | 1 | 181.00 | 0.565068 | 181.0 | 31 | 0.00 | 1787 | 3 | 0 | 21182.0 | 0.0 | 1 |
| 4 | 1 | 11668.14 | 0.517114 | 41554.0 | 90 | 29885.86 | 3997 | 8 | 0 | 0.0 | 0.0 | 0 |
Distribución de algunas variables
# type
plt.figure(figsize=(15,5))
pd.concat([datos], axis=1).groupby(('type'))['isFraud'].count().plot(kind="bar")
<AxesSubplot:xlabel='type'>
# device
plt.figure(figsize=(15,5))
pd.concat([datos], axis=1).groupby(('device'))['isFraud'].count().plot(kind="bar")
<AxesSubplot:xlabel='device'>
# zone
plt.figure(figsize=(15,5))
pd.concat([datos], axis=1).groupby(('zone'))['isFraud'].count().plot(kind="bar")
<AxesSubplot:xlabel='zone'>
Observando los gráficos podemos inferir que los perfiles que más comenten fraude son mayoritariamente de la zona country, utilizan para cometer el fraude mac y suelen cobrar en efectivo
# correlations between numerical variables
get_corr_matrix(dataset = datos[numericas],
metodo='pearson', size_figure=[10,8])
0
# correlations
corr = datos[numericas].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1)
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.3]
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 63 | newbalanceOrig | oldbalanceOrg | 0.999050 |
| 129 | newbalanceDest | oldbalanceDest | 0.978401 |
| 98 | security_alert | connection_time | 0.520283 |
| 121 | newbalanceDest | amount | 0.311942 |
con la correlación entre variables numéricas vemos la relación que existe entre las variables, es decir, la correlación que existe con la variable objetivo. Podemos observar que las variables que más correlación tienen son newbalanceOrig con oldbalanceOrg y newbalanceDest con oldbalanceDest como es lógico ya que newbalanceOrig es el saldo después de la transacción, oldbalanceOrg es el saldo antes de la transacción, newbalanceDest es el balance final del destinatario y oldbalanceDest es el balance inicial.
# Include the target variable to see how it relates to the categorical variables.
if 'isFraud' not in categoricas:
categoricas.append('isFraud')
datos[categoricas].head()
| type | device | zone | isFraud | |
|---|---|---|---|---|
| 0 | PAYMENT | mac | capital | 0 |
| 1 | PAYMENT | mac | country | 0 |
| 2 | TRANSFER | pc | capital | 1 |
| 3 | CASH_OUT | mac | country | 1 |
| 4 | PAYMENT | mac | country | 0 |
ahora voy a mostrar la matriz de confusión de cada variable categórica con la variable objetivo
confusion_matrix = pd.crosstab(datos["isFraud"], datos["type"])
print(confusion_matrix)
cramers_v(confusion_matrix.values)
type CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER isFraud 0 227130 373063 7178 353873 86189 1 0 578 0 0 564
0.053888476136022816
vemos que la mayoría de fraudes son cometidos con transacciones en efectivo y por transferencia
confusion_matrix = pd.crosstab(datos["isFraud"], datos["device"])
print(confusion_matrix)
cramers_v(confusion_matrix.values)
device iphone mac pc isFraud 0 261767 366677 314528 1 295 389 339
0.0
esta matriz nos muestra que la mayoría de los fraudes son cometidos con mac o pc
confusion_matrix = pd.crosstab(datos["isFraud"], datos["zone"])
print(confusion_matrix)
cramers_v(confusion_matrix.values)
zone africa capital country isFraud 0 313670 261845 367617 1 356 286 387
0.0
aquí podemos ver que las transacciones son realizadas desde country y Africa
confusion_matrix = pd.crosstab(datos["isFraud"], datos["isFraud"])
cramers_v(confusion_matrix.values)
0.9995616938531262
# Análisis Nulos por columnas
nulos_col = pd.merge(datos.isnull().sum().sort_values().to_frame('missing_value').reset_index(),
datos.dtypes.to_frame('feature_type').reset_index(),
on = 'index',
how = 'inner')
nulos_col['columns_percentage'] = nulos_col['missing_value']/datos.shape[0]
nulos_col.sort_values(['missing_value', 'feature_type', 'columns_percentage'], ascending=False)
| index | missing_value | feature_type | columns_percentage | |
|---|---|---|---|---|
| 14 | device | 104580 | object | 0.099735 |
| 13 | zone | 104414 | object | 0.099577 |
| 1 | type | 0 | object | 0.000000 |
| 2 | amount | 0 | float64 | 0.000000 |
| 3 | connection_time | 0 | float64 | 0.000000 |
| 4 | oldbalanceOrg | 0 | float64 | 0.000000 |
| 6 | newbalanceOrig | 0 | float64 | 0.000000 |
| 10 | oldbalanceDest | 0 | float64 | 0.000000 |
| 11 | newbalanceDest | 0 | float64 | 0.000000 |
| 0 | step | 0 | int64 | 0.000000 |
| 5 | age | 0 | int64 | 0.000000 |
| 7 | user_number | 0 | int64 | 0.000000 |
| 8 | user_connections | 0 | int64 | 0.000000 |
| 9 | security_alert | 0 | int64 | 0.000000 |
| 12 | isFraud | 0 | int64 | 0.000000 |
# Defining the steps in the numerical pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# Defining the steps in the categorical pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Numerical features to pass down the numerical pipeline
numeric_features = datos.select_dtypes(include=['int64', 'float64']).drop(['isFraud'], axis=1).columns
# Categrical features to pass down the categorical pipeline
categorical_features = datos.select_dtypes(include=['object']).columns
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# save preprocessor
with open('./datos/preprocessor.pickle', 'wb') as f:
pickle.dump(preprocessor, f)
xtrain, xtest , ytrain, ytest = train_test_split(datos.drop('isFraud',axis=1),
datos['isFraud'],
stratify=datos['isFraud'],
test_size=0.2)
comprobamos que train y test son similares
# Train
pd.concat([xtrain, pd.DataFrame(ytrain)]).describe().round(3)
| step | amount | connection_time | oldbalanceOrg | age | newbalanceOrig | user_number | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 838860.000 | 8.388600e+05 | 838860.000 | 8.388600e+05 | 838860.000 | 8.388600e+05 | 838860.000 | 838860.000 | 838860.0 | 8.388600e+05 | 8.388600e+05 | 838860.000 |
| mean | 26.971 | 1.585422e+05 | 0.500 | 8.744421e+05 | 52.479 | 8.943020e+05 | 2529.789 | 5.504 | 0.1 | 9.783038e+05 | 1.113724e+06 | 0.001 |
| std | 15.627 | 2.637555e+05 | 0.289 | 2.970275e+06 | 27.730 | 3.006908e+06 | 1426.991 | 2.873 | 0.3 | 2.299382e+06 | 2.418195e+06 | 0.033 |
| min | 1.000 | 1.000000e-01 | 0.000 | 0.000000e+00 | 5.000 | 0.000000e+00 | 59.000 | 1.000 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000 |
| 25% | 15.000 | 1.215914e+04 | 0.250 | 0.000000e+00 | 28.000 | 0.000000e+00 | 1292.000 | 3.000 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000 |
| 50% | 20.000 | 7.651737e+04 | 0.501 | 1.598700e+04 | 52.000 | 0.000000e+00 | 2531.000 | 6.000 | 0.0 | 1.269864e+05 | 2.185205e+05 | 0.000 |
| 75% | 39.000 | 2.137878e+05 | 0.750 | 1.366074e+05 | 76.000 | 1.745716e+05 | 3765.000 | 8.000 | 0.0 | 9.160216e+05 | 1.149069e+06 | 0.000 |
| max | 95.000 | 1.000000e+07 | 1.000 | 3.893942e+07 | 100.000 | 3.894623e+07 | 5000.000 | 10.000 | 1.0 | 4.205466e+07 | 4.216916e+07 | 1.000 |
# Test
pd.concat([xtest, pd.DataFrame(ytest)]).describe().round(3)
| step | amount | connection_time | oldbalanceOrg | age | newbalanceOrig | user_number | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 209715.000 | 2.097150e+05 | 209715.000 | 2.097150e+05 | 209715.000 | 2.097150e+05 | 209715.000 | 209715.000 | 209715.000 | 2.097150e+05 | 2.097150e+05 | 209715.000 |
| mean | 26.946 | 1.591663e+05 | 0.500 | 8.722593e+05 | 52.366 | 8.918164e+05 | 2533.044 | 5.500 | 0.099 | 9.775845e+05 | 1.116070e+06 | 0.001 |
| std | 15.607 | 2.696307e+05 | 0.288 | 2.977526e+06 | 27.710 | 3.013597e+06 | 1424.122 | 2.871 | 0.299 | 2.286345e+06 | 2.409988e+06 | 0.033 |
| min | 1.000 | 2.000000e-01 | 0.000 | 0.000000e+00 | 5.000 | 0.000000e+00 | 59.000 | 1.000 | 0.000 | 0.000000e+00 | 0.000000e+00 | 0.000 |
| 25% | 15.000 | 1.211678e+04 | 0.251 | 0.000000e+00 | 28.000 | 0.000000e+00 | 1306.000 | 3.000 | 0.000 | 0.000000e+00 | 0.000000e+00 | 0.000 |
| 50% | 20.000 | 7.568435e+04 | 0.499 | 1.605800e+04 | 52.000 | 0.000000e+00 | 2535.000 | 6.000 | 0.000 | 1.238940e+05 | 2.173152e+05 | 0.000 |
| 75% | 39.000 | 2.136442e+05 | 0.750 | 1.367684e+05 | 76.000 | 1.747509e+05 | 3762.000 | 8.000 | 0.000 | 9.154374e+05 | 1.153266e+06 | 0.000 |
| max | 95.000 | 1.000000e+07 | 1.000 | 3.856340e+07 | 100.000 | 3.893942e+07 | 5000.000 | 10.000 | 1.000 | 4.130394e+07 | 4.205466e+07 | 1.000 |
He decidido no emplear ninguna técnica de selección de variables. Anteriormente eliminamos varias variables que no podíamos usar en nuestro problema y ya de por si tenemos un número reducido de variables y para la mayoría de nuestros modelos no supone ningún problema, más allá de un mayor coste computacional, la introducción de variables correlacionadas o con poco poder predictivo.
# Save the dataframes
xtrain.to_parquet("./datos/xtrain.parquet")
xtest.to_parquet("./datos/xtest.parquet")
pd.DataFrame(ytrain).to_parquet("./datos/ytrain.parquet")
pd.DataFrame(ytest).to_parquet("./datos/ytest.parquet")